In this notebook, I have analyzed Ford GoBike system data for the year of 2019. This dataset includes stations and trips information of three cities; San Francisco, San Jose and Oakland. Data is published monthly in a zipped CSV file. In this work, I am interested in people's behaviour towards the bike-sharing system. I would like to know how and for what purpose people are using the system. I have defined 9 research questions and answered all the questions with one or more data visualizations. You can find a summary of the answers to all the questions in the conclusion section.
As the trip dataset has about 2.5 milion records, some code related to this dataset will take up to 3 minutes to execute
Clean data is imported in this section. Our main motivation is to understand people's bike riding behaviours such as when, where and why they use the bikes. These are the main question we are trying to answer in this project:
- Is there any relationship between the day of the week and trip duration, distance and speed?
- How are the stations distributed geographically?
- What time in a day do people use bikes more frequently?
- How did bike usage change in different months?
- How different users (Subscribers or casual customers) use bikes?
- How subscribers' and casual users' trips are distributed in each city?
- Are there any suspicious activities?
- Is there any travel between cities?
- How much gas have been saved by the Ford GoBike system?
Short answer to the all questions are provided in the Conclusion section.
Number of trips are considerably higher during the week comparing to the weekend. It can be an indication that people use bikes to commute to work.
The mean duration of trips are about 3 minutes more at weekends.
The mean distance of trips are slightly shorter during weekends.
The mean speed of trips is about 1.5km/h less during weekends.
As we can see in the following graphs, people's behaviour towards riding bikes is very different during weekdays and weekends. During weekdays people have more trips and ride faster. They go longer distances slightly in a shorter time. We can conclude that during weekdays people use bikes more seriously, probably for commuting between home and work. But at weekends it is more for pleasure.
# Trip frequency during weekday and weekends
plt.figure(figsize=[18,15])
plt.subplot(2,2,1)
base_color = sb.color_palette()[0]
sb.countplot(data=df_trips, x = 'start_weekday', color=base_color);
plt.xlabel('Week day(Start of the trips)',fontsize=12)
plt.ylabel('Trip count',fontsize=12)
plt.title('Frequency of trips in week days',fontsize=14)
plt.yticks(np.arange(0, 450000, 50000), ['0', '50K', '100K','150K','200K','250K','300K','350K','400K']);
n_points=df_trips.shape[0]
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
count = len(df_trips[df_trips['start_weekday'] == int(label.get_text())])
pct_string = '{:0.1f}K'.format(count/1000)
plt.text(loc, count-15000, pct_string, ha='center', color='w' )
plt.xticks([0,1,2,3,4,5,6],['Monday', 'Tuesday', 'Wedensday', 'Thursday', 'Friday', 'Saturday', 'Sunday']);
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.subplot(2,2,2);
data = df_trips.groupby('start_weekday').mean()['duration_sec'].reset_index()
plt.bar(data=data, x = 'start_weekday', height='duration_sec')
plt.xticks(np.arange(0,7,1), ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.yticks(np.arange(0,901,60),np.arange(0,16,1))
plt.title('Mean of trip duration for each week day', fontsize = 14)
plt.ylabel('Mean trip duratoin (mins)', fontsize = 12)
plt.xlabel('Week days', fontsize = 12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.subplot(2,2,3);
data = df.groupby('start_weekday').mean()['distance'].reset_index()
plt.bar(data=data, x = 'start_weekday', height='distance')
plt.xticks(np.arange(0,7,1), ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.yticks(np.arange(0,1601,200),['0','0.2km','0.4km','0.6km','0.8km','1.0km','1.2km','1.4km','1.6km'])
plt.title('Mean of trip distance for each week day', fontsize = 14)
plt.ylabel('Mean trip distance (km)', fontsize = 12)
plt.xlabel('Week days', fontsize = 12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.subplot(2,2,4);
data = df.groupby('start_weekday').mean()['speed'].reset_index()
plt.bar(data=data, x = 'start_weekday', height='speed')
plt.xticks(np.arange(0,7,1), ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
#plt.yticks(np.arange(0,1601,200),['0','0.2km','0.4km','0.6km','0.8km','1.0km','1.2km','1.4km','1.6km'])
plt.title('Mean of trip speed for each week day', fontsize = 14)
plt.ylabel('Mean trip speed (km/h)', fontsize = 12)
plt.xlabel('Week days', fontsize = 12)
plt.grid(True, axis='y', which='both')
plt.rcParams['axes.axisbelow'] = True
There are more bike sharing stations in San Francisco comparing to Oakland and San Jose. San Francisco has 230 stations, Oakland 132 and San Jose 94. However, San Jose is the most populated city among these three cities but number of bike stations are less than other cities. To understand number of stations based on population see the following table:
| City | Population | Bike statation per 100,000 people |
|---|---|---|
| San Jose | 1,035,000 | 9.37 |
| San Francisco | 884,000 | 26 |
| Oakland | 425,000 | 31 |
Oakland has the most bike stations per 100,000 people.
Trips are generally longer in San Francisco. Trips in Oakland and San Jose takes about 10 minutes. This number in San Francisco is about 12 minutes.
Trips distance in San Francisco 250 meters longer than Oakland and about 600 meters longer than San Jose.
People in Oakland ride faster than San Fransisco and San Jose.
# City analysis
plt.figure(figsize=[18,15])
plt.subplot(2,2,1)
base_color = sb.color_palette()[9]
sb.countplot(data=df_stations, x = 'city', color=base_color);
plt.title('Number of stations in each city', fontsize = 14);
plt.ylabel('Number of stations',fontsize = 12);
plt.xlabel('City',fontsize = 12);
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
count = len(df_stations[df_stations['city'] == label.get_text()])
plt.text(loc, count-10, count, ha='center', color='w' )
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.subplot(2,2,2)
base_color = sb.color_palette()[9]
data = df.groupby('city_x').mean()['duration_sec'].reset_index()
plt.bar(data=data, x = 'city_x',height='duration_sec', color=base_color);
plt.title('Mean trip duration for each city', fontsize = 14)
plt.ylabel('Mean Trip duration (mins)',fontsize = 12);
plt.xlabel('City',fontsize = 12);
plt.yticks(np.arange(0,721,60), np.arange(0,13,1));
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.subplot(2,2,3)
base_color = sb.color_palette()[9]
data = df.groupby('city_x').mean()['distance'].reset_index()
plt.bar(data=data, x = 'city_x',height='distance', color=base_color);
plt.title('Mean trip distance for each city', fontsize = 14)
plt.ylabel('Mean Trip distance (km)',fontsize = 12);
plt.xlabel('City',fontsize = 12);
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.subplot(2,2,4)
base_color = sb.color_palette()[9]
data = df.groupby('city_x').mean()['speed'].reset_index()
plt.bar(data=data, x = 'city_x',height='speed', color=base_color);
plt.title('Mean trip speed for each city', fontsize = 14)
plt.ylabel('Mean Trip speed (km/h)',fontsize = 12);
plt.xlabel('City',fontsize = 12);
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
Between 7 to 9 am and 16 to 18 pm are peak time for bike usage. These hours are the times when usually people commute. This is another indication that people are using bikes for commuting between home and work.
Mean of trip durations:¶
Trip duration mean is almost the same in different hours of a day.
plt.figure(figsize=[18,15])
plt.subplot(2,1,1)
base_color = sb.color_palette()[9]
data = df_trips['start_time'].dt.hour.reset_index()
sb.countplot(data=data, x='start_time', color=base_color);
plt.title('Trip frequency in every hour of the day', fontsize = 14)
plt.xlabel('Hour', fontsize=12);
plt.ylabel('Trip count', fontsize=12);
locs, lables = plt.xticks()
for loc,label in zip(locs, lables):
count = len(data[data['start_time'] == int(label.get_text())])
plt.text(loc, count+1500,str(round(count/1000,1)) +'K', ha='center', color='black')
plt.yticks(np.arange(0,300001, 50000), ['0','50K','100K','150K','200K','250K','300K']);
plt.subplot(2,1,2);
data = df_trips.groupby(df['start_time'].dt.hour).mean()['duration_sec'].reset_index()
plt.bar(data=data, x = 'start_time', height='duration_sec')
plt.xticks(np.arange(0,24,1));
plt.yticks(np.arange(0,750,60), np.arange(0,13,1));
plt.title('Mean trip duration for each hour', fontsize = 14)
plt.xlabel('Hours', fontsize=12)
plt.ylabel('Mean trip duration (mins)', fontsize=12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.figure(figsize=[18,15])
plt.subplot(2,1,1);
data = df.groupby(df['start_time'].dt.hour).mean()['distance'].reset_index()
plt.bar(data=data, x = 'start_time', height='distance')
plt.xticks(np.arange(0,24,1));
plt.title('Mean trip distance for each hour', fontsize = 14)
plt.xlabel('Hours', fontsize=12)
plt.ylabel('Mean trip distance (km)', fontsize=12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.subplot(2,1,2);
data = df.groupby(df['start_time'].dt.hour).mean()['speed'].reset_index()
plt.bar(data=data, x = 'start_time', height='speed')
plt.xticks(np.arange(0,24,1));
plt.title('Mean trip speed for each hour', fontsize = 14)
plt.xlabel('Hours', fontsize=12)
plt.ylabel('Mean trip speed (km/h)', fontsize=12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
The number of trips is higher in March, April and October. January, February, November and December are the most rainest months in this region. May, July, Jun and August are the hottest months with direct sun. It seems people select a bike when the weather is not too rainy or too sunny.
If all the trips had done using cars, it would use 445,910 of litters of gas. The GoBike system has saved this gas usage, which is very valuable. Considering gas price as low as 1 USD, this system saved 445,910 dollars in 2019.
plt.figure(figsize = [15,30]);
plt.subplot(4,1,1)
base_color = sb.color_palette()[0]
data = df_trips['start_time'].dt.month.reset_index()
plt.title('Trip frequency in every month', fontsize = 14)
sb.countplot(data=data, x='start_time', color=base_color);
locs, lables = plt.xticks()
for loc,label in zip(locs, lables):
count = len(data[data['start_time'] == int(label.get_text())])
plt.text(loc, count-7500,str(round(count/1000,1)) +'K', ha='center', color='w' )
plt.xticks(np.arange(0,13,1),['Jan','Feb','March','Apr','May','Jun','July','Aug','Sep','Oct','Nov', 'Dec']);
plt.yticks(np.arange(0,300000, 50000), ['0','50K','100K','150K','200K','250K']);
plt.xlabel('Month', fontsize=12);
plt.ylabel('Trip count', fontsize=12);
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
data = df[['distance','start_time']].groupby(df['start_time'].dt.month).sum().reset_index()
#Distance to km
data['distance'] = data['distance']/1000
# Considering 10 litter of gass per 100 km.
data['gas_saved'] = data['distance']/10
fig = plt.figure(figsize=(15,8))
ax = fig.add_axes([.125, .125, .775, .755])
ax.bar(data = data, height='gas_saved', x='start_time');
ax.ticklabel_format(style='plain')
plt.xticks(np.arange(1,13,1),['Jan','Feb','March','Apr','May','Jun','July','Aug','Sep','Oct','Nov', 'Dec']);
plt.xlabel('Month', fontsize = 12)
plt.ylabel('Gas saved (Litter)', fontsize = 12)
plt.yticks(np.arange(0,50001,5000))#,['0','5m','10m','15m','20m','25m','30m','35m','40m','45m','50m'])
plt.title('Gas saved per month (Litter)',fontsize = 14)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
total = round(data['gas_saved'].sum()/1000,2)
plt.text(x=4, y=50000, s='Total gass saved in 2019: '+str(total)+ 'k litters', fontsize=15, color='red')
Text(4, 50000, 'Total gass saved in 2019: 443.98k litters')
80.63% of the users are Subscribers and 19.37% of them are casual customers.
data = df_trips.user_type.value_counts()
plt.pie(data, labels = data.index, startangle=90, counterclock=False, autopct='%.2f');
plt.axis('square');
- In the data we have little number of trips longer than 2 hours which most of them have very low speed (0.44 km/h). It shows during these trips most of the time the bike was idle. I removed these trips for future analysis.
- Most of the trips (about 900k) are between 5 to 10 minutes. Considering the average speed of 20 km/h, it means most of the trips are between 1.5 to 3 km.
- About 550k of the trips are between 10 to 15 minutes, which is about 3 to 4.5 km in distance.
- The number of trips with a trip duration of fewer than 5 minutes is considerable (400k). It shows people use bikes for a short time and probably short distances. It can justify having stations close together.
- The number of trips with longer times decreases to the extent that we have only 3k trips with a duration of one hour and 600 trips for 2-hour trips.
- The entire histogram shows people use bikes for relatively short times.
#Trip duration histogram for the entire data
plt.figure(figsize=[15,6])
bins= np.arange(0, 24*3600+1, 900)
plt.hist(data=df, x='duration_sec', bins=bins);
plt.xticks(np.arange(0, 24*3600+1, 3600),np.arange(0, 24, 1));
plt.yticks(np.arange(0, 1500001, 500000),['0','500K','1000K','1500K']);
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.title('Frequency of trips based on trip duration (bin size = 15 minutes)', fontsize = 14)
plt.xlabel('Trip duration (hour)', fontsize =12)
plt.ylabel('Frequency (log)', fontsize = 12)
speed = round(df.query('duration_sec > 2*3600')['speed'].mean(),2)
plt.yscale('log')
plt.text(x=23500,y=2000000, s='Mean speed for trips durations more than 2 hours is '+str(speed)+' km/h',fontsize = 12, color='red')
# Trip duration histogram
plt.figure(figsize=[15,6])
bins= np.arange(0, 2*3600+1, 300)
plt.hist(data=df_trips, x='duration_sec', bins=bins);
plt.xticks(np.arange(0, 2*3600+1, 300),np.arange(0, 121, 5));
plt.yticks(np.arange(0, 900001, 200000),['0','200K','400K','600K','800K']);
plt.title('Frequency of trips based on trip duration (bin size = 5 minutes) for trips less than 2 hours', fontsize = 14)
plt.xlabel('Trip duration (minutes)', fontsize =12)
plt.ylabel('Frequency (log)', fontsize = 12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.yscale('log')
# Trip duration histogram for less than an hour and between one and two hours
plt.figure(figsize=[15,6])
plt.subplot(1,2,1)
data = df_trips.query('duration_sec <= 3600')
bins= np.arange(0, 3600+1, 300)
plt.hist(data=data, x='duration_sec', bins=bins);
plt.xticks(np.arange(0, 3601, 300),np.arange(0, 61, 5));
plt.yticks(np.arange(0, 900001, 200000),['0','200K','400K','600K','800K'])
plt.title('Frequency of trips for trips less than 1 hours', fontsize = 14)
plt.xlabel('Trip duration (minutes)', fontsize =12)
plt.ylabel('Frequency (log)', fontsize = 12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
plt.yscale('log')
plt.subplot(1,2,2)
data = df_trips.query('duration_sec > 3600')
bins= np.arange(3600, 2*3600+1, 300)
plt.hist(data=data, x='duration_sec', bins=bins);
plt.xticks(np.arange(3600, 2*3600+1, 300),np.arange(60, 121, 5));
plt.title('Frequency of trips for trips between 1 and 2 hours', fontsize = 14)
plt.xlabel('Trip duration (minutes)', fontsize =12)
plt.ylabel('Frequency', fontsize = 12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
We can call two types of activities suspicious:
- Trips with long distance in a short time (very quick trips): There may be cases that bikes were moved by car.
- There is a trip with a speed of 70. The bike which was used in this trip is moved from San Francisco to San Jose and never had a trip after this trip, which is very suspicious.
- Trips with very long duration (more than 5 hours)
- There are 1967 trips with a long duration. Most probably, bikes were idle during these trips. One scenario is that the user did not disconnect from the bike, and the trip continued.
plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
data = df.query('speed > 40')
sb.barplot(data = data , x=data.index, y='speed', color=sb.color_palette()[0],ci=None);
plt.tick_params(axis='x', rotation=45)
plt.title('Trips with speed higher than 40', fontsize = 14)
plt.xlabel('Trip Id', fontsize =12)
plt.ylabel('Speed', fontsize = 12)
plt.subplot(1,2,2)
data = df_full.query('duration_sec > 10*3600');
plt.hist( x=data['duration_sec']);
plt.xticks(np.arange(36001, data.duration_sec.max(), 3600),np.arange(10, 24, 1));
plt.yticks(np.arange(0, 301, 50),['0','50','100','150','200','250','300']);
plt.title('Frequency of trips for trips more than 10 hours', fontsize = 14)
plt.xlabel('Trip duration (hours)', fontsize =12)
plt.ylabel('Frequency', fontsize = 12)
sum10 = df_full.query('duration_sec > 10*3600').shape[0]
plt.text(x=45000,y=270, s='Number of all the trips with trip\nduration more than 10 hours: '+str(sum10),fontsize = 13, color='red')
Text(45000, 270, 'Number of all the trips with trip\nduration more than 10 hours: 1967')
Stations are distributed in three cities; San Francisco(230 stations), Oakland(132) and San Jose(94). You can zoom on the map and see more details.
Most of the stations are located on the west side of the city. Stations in the city center are closer together. By clicking on the markers, you can see the name and id of the station.
#San Francisco
locations = df_stations.query('city=="San Francisco"')[['latitude','longitude','station_id','name']].reset_index()
locations['text'] ='Station Id: ' +locations['station_id'].apply(str)+' - Name: '+ locations['name']
markers = gmaps.marker_layer(locations[['latitude','longitude']], info_box_content = locations['text'])
fig = gmaps.figure(center=(37.77498, -122.419234), zoom_level=13.3, layout={'width': '900px','height': '700px','padding': '3px','border': '1px solid black'})
fig.add_layer(markers)
embed_minimal_html('FigureHTMLs/SanFranciscoStations.html', views=[fig])
fig
Stations are located mostly in the city center. There are no stations in the north part of the city as well as the Central Berkeley district. By clicking on the markers, you can see the name and id of the station.
#Oakland
locations = df_stations.query('city == "Oakland"')[['latitude','longitude','station_id','name']].reset_index()
locations['text'] ='Station Id: ' +locations['station_id'].apply(str)+' - Name: '+ locations['name']
markers = gmaps.marker_layer(locations[['latitude','longitude']], info_box_content = locations['text'])
fig = gmaps.figure(center=(37.82448, -122.271234), zoom_level=12.5, layout={'width': '900px','height': '700px','padding': '3px','border': '1px solid black'})
fig.add_layer(markers)
embed_minimal_html('FigureHTMLs/oaklandStations.html', views=[fig])
fig
Stations are located mostly in the city center. By clicking on the markers, you can see the name and id of the station.
# San Jose
locations = df_stations.query('city == "San Jose"')[['latitude','longitude','station_id','name']].reset_index()
locations['text'] ='Station Id: ' +locations['station_id'].apply(str)+' - Name: '+ locations['name']
markers = gmaps.marker_layer(locations[['latitude','longitude']], info_box_content = locations['text'])
fig = gmaps.figure(center=(37.34058, -121.890234), zoom_level=13.3, layout={'width': '900px','height': '700px','padding': '3px','border': '1px solid black'})
fig.add_layer(markers)
embed_minimal_html('FigureHTMLs/SanJoseStations.html', views=[fig])
fig
Subscribers tend to use bikes more during the peak-hours comparing to casual customers. The red line on the following figure shows the proportion of casual customers to the subscribers. As you see, the percentage decreases on the peak-hours, which explains that subscribers use the bikes as their commuting device. In other words, people who tend to commute by bike subscribe.
data = df_trips[['start_time','user_type']]
data.loc[:,'hour'] = data['start_time'].dt.hour
data = data.pivot_table(index='hour', columns='user_type',aggfunc='count')
data['rate'] = round(data[('start_time', 'Customer')]/data[('start_time', 'Subscriber')] * 100 , 2)
data['rate(5000x)'] = data['rate']*5000
data['rate'] = data['rate'].apply(lambda x: '{0:.2f}%'.format(x))
data = data.reset_index()
data.columns = ['hour','Customer','Subscriber','rate', 'rate(5000x)']
fig1, ax = plt.subplots(figsize=[18,10])
data['rate(5000x)'].plot(linestyle='-', marker='o', color = '#E65100')
data[['hour','Customer','Subscriber']].plot(x='hour', kind='bar', ax =ax, color=['#FFB74D','#4FC3F7'])
plt.title("Number of trips by subscribers and casual customers in different hours of day", fontsize = 14, fontweight='bold')
plt.legend(['Rate of casual customer to subscriber','Customer','Subscriber'])
plt.yticks(np.arange(0,200001,50000), ['0','50k','100k','150k','200k'])
x = ax.get_xticks()
for a,b,c in zip(x,data['rate(5000x)'], np.array( data['rate'])):
plt.text(a,b+1500,c, color='#616161', fontsize=12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
During the weekdays, subscribers have more trips comparing to the weekends. This number does not change for customers. Rate of casual customers to subscribers doubles at the weekends, which as another indication that most of the subscribers use bikes for commuting.
fig1, ax = plt.subplots(figsize=[18,10])
data['rate(5000x)'].plot(linestyle='-', marker='o', color = '#E65100')
data[['hour','Customer','Subscriber']].plot(x='hour', kind='bar', ax =ax, color=['#FFB74D','#4FC3F7'])
plt.title("Number of trips by subscribers and casual customers in days of week", fontsize = 14, fontweight='bold')
plt.legend(['Rate of casual customer to subscriber','Customer','Subscriber'])
plt.xticks([0,1,2,3,4,5,6],['Monday', 'Tuesday', 'Wedensday', 'Thursday', 'Friday', 'Saturday', 'Sunday']);
plt.yticks(np.arange(0,350001,50000), ['0','50k','100k','150k','200k','250k','300k','350k'])
plt.ylabel('Number of trips', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
x = ax.get_xticks()
for a,b,c in zip(x,data['rate(5000x)'], np.array( data['rate'])):
plt.text(a,b+1500,c, color='#616161', fontsize=12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
Rate of casual customers to subscribers is high in San Francisco(26.53%) comparing to Oakland(20.34%) and San Jose(12.58%).
fig1, ax = plt.subplots(figsize=[18,10])
data['rate(20000x)'].plot(linestyle='-', marker='o', color = '#E65100')
data[['hour','Customer','Subscriber']].plot(x='hour', kind='bar', ax =ax, color=['#FFB74D','#4FC3F7'])
plt.title("Number of trips by subscribers and casual customers in cities", fontsize = 14, fontweight='bold')
plt.legend(['Rate of casual customer to subscriber','Customer','Subscriber'])
plt.yticks(np.arange(0,1600001,200000), ['0','200k','400k','600k','800k','1,000k','1,200k','1,400k','1,600K'])
x = ax.get_xticks()
for a,b,c in zip(x,data['rate(20000x)'], np.array( data['rate'])):
plt.text(a,b+1500,c, color='#616161', fontsize=12)
plt.grid(True, axis='y')
plt.rcParams['axes.axisbelow'] = True
The following heatmap shows the amount of traffic for each station in San Francisco. The larger circles indicate higher trips starting from the station. Stations on the west and center of the town having more trips. Some stations on the east and south side of the city have very little traffic.
#San Francisco
locations = df.query('city_x=="San Francisco"')[['latitude_x','longitude_x']].groupby(['latitude_x','longitude_x']).size().reset_index()
locations.columns = ['latitude_x', 'longitude_x', 'count']
weights = locations['count']
fig = gmaps.figure(center=(37.77498, -122.419234), zoom_level=13, layout={'width': '900px','height': '700px','padding': '3px','border': '1px solid black'})
heatmap_layer = gmaps.heatmap_layer(locations[['latitude_x','longitude_x']],weights )
heatmap_layer.max_intensity = 200
heatmap_layer.point_radius = 7
fig.add_layer(heatmap_layer)
embed_minimal_html('FigureHTMLs/SanFranciscoTrips.html', views=[fig])
fig
The following heatmap shows the amount of traffic for each station in San Jose. The larger circles indicate higher trips starting from the station. Stations on the downtwon having more trips. Some stations on the north and south side of the city have very little traffic.
#San Jose
locations = df.query('city_x=="San Jose"')[['latitude_x','longitude_x']].groupby(['latitude_x','longitude_x']).size().reset_index()
locations.columns = ['latitude_x', 'longitude_x', 'count']
weights = locations['count']
fig = gmaps.figure(center=(37.34058, -121.890234), zoom_level=13.3, layout={'width': '900px','height': '700px','padding': '3px','border': '1px solid black'})
heatmap_layer = gmaps.heatmap_layer(locations[['latitude_x','longitude_x']],weights )
heatmap_layer.max_intensity = 200
heatmap_layer.point_radius = 7
fig.add_layer(heatmap_layer)
embed_minimal_html('FigureHTMLs/SanJoseTrips.html', views=[fig])
fig
The following heatmap shows the amount of traffic for each station in Oakland. The larger circles indicate higher trips starting from the station. Stations on the downtwon having more trips. Some stations on the south side of the city have very little traffic.
#Oaklan
locations = df.query('city_x=="Oakland"')[['latitude_x','longitude_x']].groupby(['latitude_x','longitude_x']).size().reset_index()
locations.columns = ['latitude_x', 'longitude_x', 'count']
weights = locations['count']
fig = gmaps.figure(center=(37.82448, -122.271234), zoom_level=12.7, layout={'width': '900px','height': '700px','padding': '3px','border': '1px solid black'})
heatmap_layer = gmaps.heatmap_layer(locations[['latitude_x','longitude_x']],weights )
heatmap_layer.max_intensity = 200
heatmap_layer.point_radius = 7
fig.add_layer(heatmap_layer)
embed_minimal_html('FigureHTMLs/oaklandTrips.html', views=[fig])
fig
The following scatter plot shows the mean duration and distance of trips for each day. Each circle represents a day, and the size of the circle shows the number of trips on that day. Each city is shown in a different colour. You can see the detail of the day with hovering the mouse on a point. Generally, people in San Francisco use bikes for longer distances and more extended time. In San Jose, trips are shorter compared to the other two cities. There is an unusual data point on September 22nd in San Jose, where trip duration is almost two times more than a regular day.
data = df.groupby([df['city_x'],df['start_time'].dt.date]).agg({'distance':'mean','duration_sec': 'mean', 'station_id_x': 'count', 'speed':'mean'}).reset_index()
data.columns = ['city', 'start_time', 'distance', 'duration', 'count','speed']
fig = px.scatter(data, y='distance',x='duration', color='city',size="count",hover_name="start_time")
fig.update_layout(
title="Trip distance-duration scatter plot",
xaxis_title="Distance(km)",
width=960,
height=720,
legend_orientation="h",
legend=dict(x=0, y=1.05),
yaxis_title="Duration(mins)",
font=dict(size=14,color="#7f7f7f"))
fig.show()
plotly.offline.init_notebook_mode(connected=True)
The following scatter plot shows the mean distance and time of trips. Each circle represents a day, and the size of the circle shows the number of trips on that day. Each city is shown in a different colour. You can see the detail of the day with hovering the mouse on a point.
fig = px.scatter(data, y='distance',x='start_time', color='city',size="count",hover_name="start_time")
fig.update_layout(
title="Trip distance-time scatter plot",
xaxis_title="Time",
width=960,
height=720,
legend_orientation="h",
legend=dict(x=0, y=1.05),
yaxis_title="Distance(km)",
font=dict(size=14,color="#7f7f7f"))
fig.show()
The following scatter plot shows the mean distance and time of trips. Each circle represents a day, and the size of the circle shows the number of trips on that day. Each city is shown in a different colour. You can see the detail of the day with hovering the mouse on a point.
fig = px.scatter(data, y='duration',x='start_time', color='city',size="count",hover_name="start_time")
fig.update_layout(
title="Trip duration-time scatter plot",
xaxis_title="Duration(km)",
width=960,
height=720,
legend_orientation="h",
legend=dict(x=0, y=1.05),
yaxis_title="Duration(mins)",
font=dict(size=14,color="#7f7f7f"))
fig.show()
The following box plot shows the distribution of trip duration for Subscribers and casual customers. I have devided data to four categories.
- Trips with less than 30 minutes duration.
- Between 30 minutes to hour.
- Between one hour to one hour and 30 minutes.
- Between one hour and 30 minutes to two hours.
plt.figure(figsize=[15,15])
data = df.query('duration_sec <= 1800')
g = sb.FacetGrid(data = data, col ='user_type', height = 4)
g.map(sb.boxplot,'city_x', 'duration_sec',order=["San Francisco","Oakland","San Jose"] );
plt.yticks(np.arange(0,1801,300), np.arange(0,31,5));
data = df.query('duration_sec > 1800 and duration_sec <= 3600')
g = sb.FacetGrid(data = data, col ='user_type', height = 4)
g.map(sb.boxplot,'city_x', 'duration_sec',order=["San Francisco","Oakland","San Jose"] );
plt.yticks(np.arange(1800,2*3600+1,300), np.arange(30,121,5));
data = df.query('duration_sec > 3600 and duration_sec <= 5400')
g = sb.FacetGrid(data = data, col ='user_type', height = 4)
g.map(sb.boxplot,'city_x', 'duration_sec',order=["San Francisco","Oakland","San Jose"] );
plt.yticks(np.arange(1800,2*3600+1,300), np.arange(30,121,5));
data = df.query('duration_sec > 5400 and duration_sec <= 7200')
g = sb.FacetGrid(data = data, col ='user_type', height = 4)
g.map(sb.boxplot,'city_x', 'duration_sec',order=["San Francisco","Oakland","San Jose"]);
plt.yticks(np.arange(1800,2*3600+1,300), np.arange(30,121,5));
<Figure size 1080x1080 with 0 Axes>
plt.figure(figsize=(12,10))
sb.heatmap(data = data, annot=True, cmap="YlGnBu");
plt.title('Travel between cities',fontsize=14)
plt.xlabel('Source City', fontsize=12);
plt.ylabel('Destination City',fontsize=12);
import matplotlib.patches as mpatches
G = nx.from_pandas_edgelist(data, 'source', 'destination', 'count')
plt.figure(3,figsize=(15,10))
pos = nx.spring_layout(G,k=0.5,iterations=30)
node_colors = ['#E57373' if node in oakland else '#4DB6AC' if node in sanJose else '#90CAF9' for node in G.nodes()]
edges = G.edges()
weights = [G[u][v]['count'] for u,v in edges]
nx.draw(G,pos,with_labels=True,node_size=1500,width=(weights), node_color=node_colors,edge_labels='test', edges=edges)
red_patch = mpatches.Patch(color='#E57373', label='Oakland')
green_patch = mpatches.Patch(color='#4DB6AC', label='San Jose')
blue_patch = mpatches.Patch(color='#90CAF9', label='San Francisco')
plt.legend(handles=[red_patch, blue_patch,green_patch]);
station_21 = (37.789625, -122.400811)
station_235 = (37.807239, -122.289370)
fig = gmaps.figure(center=(37.80498, -122.339234), zoom_level=13, layout={'width': '900px','height': '700px','padding': '3px','border': '1px solid black'})
geneva2zurich = gmaps.directions_layer(station_21, station_235)
fig.add_layer(geneva2zurich)
fig